image.png

Data Science and Business Analytics

Practice Project III

Prediction of Medical Insurance Charges

By

Hayford Osumanu

December 2022

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Problem Statement

Context:

Many factors that affect how much you pay for health insurance are not within your control. Nonetheless, it's good to have an understanding of what they are. Here are some factors that affect how much health insurance premiums cost

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest

image.png

Objective of the Project

In this project, we are going to extract some important insights from a dataset that contains details about the background of a person who is purchasing medical insurance along with what amount of premium is charged to those individuals as well using Machine Learning in Python.

image.png

Importing necessary libraries

Let's start by importing libraries we need.

Python Libraries Fucntions

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

image.png

Reading the dataset

Overview of the dataset

View the first 5 rows of the dataset

Check data types and number of non-null values for each column

Summary of the dataset

Number of unique values in each column

Number of observations in each category

Exploratory Data Analysis (EDA) Summary

Statistical summary of the numerical columns in both train and test dataset

Statistical Summary of the Dataset

image.png

The below functions need to be defined to carry out the EDA.

Univariate analysis

Bivariate analysis

Correlation Check

Part III: EDA - Multivariate Data Analysis

Boxplot Comparison Analysis

image.png

Comparison of the Numerical Columns

image.png

BMI in Relation to Region

image.png

BMI in Relation to Smoker

image.png

BMI in Relation to Sex

image.png

Charges in Relation to Region

image.png

Charges in Relation to Smoker

image.png

Charges in Relation to Sex

image.png

Age in Relation to Region

image.png

Age in Relation to Smoker

image.png

Age in Relation to Sex

image.png

Number of Children in Relation to Region

image.png

Number of Children in Relation to smoker

image.png

Number of Children in Relation to Sex

General Observation

image.png

image.png

image.png

image.png

General Observation

image.png

image.png

General Observation

image.png

image.png

Correlation and Pairplot Analysis

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Data Preprocessing

Outlier Detection and Treatment

Data Preparataion for model building

Bagging - Model Building and Hyperparameter Tuning

Decision Tree Model

Hyperparameter Tuning

Plotting the feature importance of each variable

Random Forest Model

Hyperparameter Tuning

Boosting - Model Building and Hyperparameter Tuning

AdaBoost Regressor

Hyperparameter Tuning

Gradient Boosting Regressor

Hyperparameter Tuning

XGBoost Regressor

Hyperparameter Tuning

Stacking Model

Now, let's build a stacking model with the tuned models - decision tree, random forest, and gradient boosting, then use XGBoost to get the final prediction.

Comparing all models

Conclusions and Business Recommendations

image.png